LOSSLESS COMPRESSION AND ALPHABET SIZE by DANIEL

نویسنده

  • DANIEL A. NAGY
چکیده

Lossless data compression through exploiting redundancy in a sequence of symbols is a well-studied field in computer science and information theory. One way to achieve compression is to statistically model the data and estimate model parameters. In practice, most general purpose data compression algorithms model the data as stationary sequences of 8-bit symbols. While this model fits very well the currently used computer architectures and the vast majority of information representation standards, other models may have both computational and information theoretic merits in being more efficient in implementation or fitting some data closer. In addition, compression algorithms based on the 8 bit symbol model perform very poorly on data represented by binary sequences not aligned with byte boundaries either because the fixed symbol length is not a multiple of 8 bits (e.g. DNA sequences) or because the symbols of the source are encoded into bit sequences of variable length. Throughout this thesis, we assume that the source alphabet consists of blocks of equal size of elementary symbols (typically bits), and address the impact of this block size on lossless compression algorithms in general and in the context of socalled block-sorting compression algorithms in particular. These algorithms are quite popular both in theory and in practice and are the subjects of intensive research with many interesting results in recent years. We show that compression on the bit level is tolerant to sources that are not aligned to byte boundaries, while performing reasonably well for byte-aligned sources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Procedures of extending the alphabet for the PPM algorithm

In this paper it is presented the lossless PPM (Prediction by Partial string Matching) algorithm and it is studied the way the alphabet can be extended for the PPM encoding so it will allow the use of symbols which are not present in the alphabet at the beginning of the encoding phase. The extended alphabet can contain symbols with the size larger than a byte. The paper presents the manner to e...

متن کامل

A Fast and E cient Nearly-Optimal Adaptive Fano Coding Scheme

Adaptive coding techniques have been increasingly used in lossless data compression. They are suitable for a wide range of applications, in which on-line compression is required, including communications, internet, e-mail, and e-commerce. In this paper, we present an adaptive Fano coding method applicable to binary and multi-symbol code alphabets. We introduce the corresponding partitioning pro...

متن کامل

Recent results in combined coding for word-based PPM

In this paper it is presented the lossless PPM (Prediction by Partial string Matching) algorithm and it is studied the way the extended alphabet can be used for the PPM encoding so it will allow the use of symbols which are not present in the alphabet at the beginning of the encoding phase. The extended alphabet can contain symbols with the size larger than a byte and at the decoding external w...

متن کامل

ar X iv : c s / 06 03 06 8 v 1 [ cs . I T ] 1 7 M ar 2 00 6 Universal Lossless Compression with Unknown Alphabets - The Average

Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alp...

متن کامل

Grammar-based codes: A new class of universal lossless source codes

We investigate a type of lossless source code called a grammar-based code, which, in response to any input data string over a fixed finite alphabet, selects a context-free grammar representing in the sense that is the unique string belonging to the language generated by . Lossless compression of takes place indirectly via compression of the production rules of the grammar . It is shown that, su...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006